Data Analyst with R Nanodegree - White Wine Analysis by Curt Hochwender

In this project, I will analyze the white wine dataset which includes wine quality rates and wine chemical properties. The objective of the analysis is to detemine which properties influence the overall wine quality.

Univariate Plots Section

In this section, a histogram, boxplot, and summary will be provided for each wine property. This output should provide a high level overview of dataset.

Histograms

Fixed Acidity Summary

  1. Narrow IQR 1
  2. Min 3.8
  3. Mean 6.8547877
  4. Max 14.2
  5. Standard Deviation 0.8438682
  6. Slight right tail
  7. A couple of outliers

Fixed Volatile Acidity

  1. Narrow IQR 0.11
  2. Min 0.08
  3. Mean 0.2782411
  4. Max 1.1
  5. Standard Deviation 0.1007945
  6. Right tail
  7. Large number of outliers

Citric Acid Summary

  1. Extremely Narrow IQR 0.12
  2. Min 0
  3. Mean 0.3341915
  4. Max 1.66
  5. Standard Deviation 0.1210198
  6. Very Slight Right tail possible normal distribution
  7. A few outliners three times the mean

Residual Sugar Summary

  1. IQR 8.2
  2. Min 0.6
  3. Mean 6.3914149
  4. Max 65.8
  5. Standard Deviation 5.0720578
  6. One value ten times the mean, possibly incorrect data
  7. Skewed distribution

Chlorides Summary

  1. Extemely Narrow IQR 0.014
  2. Min 0.009
  3. Mean 0.0457724
  4. Max 0.346
  5. Standard Deviation 0.021848
  6. Long Right tail
  7. It will be interesting to see how chlorides impact quality

Free Sulfur Dioxide Summary

  1. Narrow IQR 23
  2. Min 2
  3. Mean 35.3080849
  4. Max 289
  5. Standard Deviation 17.0071373
  6. Right tail
  7. One severe outlier, possible incorrect data

Total Sulfur Dioxide Summary

  1. Narrow IQR 59
  2. Min 9
  3. Mean 138.3606574
  4. Max 440
  5. Standard Deviation 42.4980646
  6. Very Slight right tail
  7. A few outliers

Density Summary

  1. Narrow IQR 0.0043775
  2. Min 0.98711
  3. Mean 0.9940274
  4. Max 1.03898
  5. Standard Deviation 0.0029909
  6. Slight right tail
  7. Again, outliers… possibly one record with incorrect data

pH Summary

  1. IQR 0.19
  2. Min 2.72
  3. Mean 3.1882666
  4. Max 3.82
  5. Standard Deviation 0.1510006
  6. Normal distribution

Sulphates Summary

  1. Narrow IQR 0.14
  2. Min 0.22
  3. Mean 0.4898469
  4. Max 1.08
  5. Standard Deviation 0.1141258
  6. Long right tail
  7. Several outliers

Summary Alcohol

  1. IQR 1.9
  2. Min 8
  3. Mean 10.514267
  4. Max 14.2
  5. Standard Deviation 1.2306206
  6. Possible binomial distribution

Quality Summary

  1. Narrow IQR 1
  2. Min 3
  3. Mean 5.8779094
  4. Max 9
  5. Standard Deviation 0.8856386
  6. Normal distribution

Total Acids Summary

  1. Narrow IQR 1.02
  2. Min 4.11
  3. Mean 7.1330288
  4. Max 14.47
  5. Standard Deviation 0.8475919
  6. Slight right tail

Comments

Our white wine dataset has quality ratings from 3-9. If we break that into three groups (bad (3-4), averge(5-7), good(8-9)), we find the majority of the wine quality for this dataset falls between 5 and 7 which is a range for average wines.

I found the boxplot helpful for identifing outliers. Also, the boxplot provides a visual for the IQR. Between the histogram and boxplot, I guess the properties with small or narrow IQR will impact wine quality.

It’s a little surprising to see the only property without outliers is alcohol. Is the alcohol content for wine regualated?

Univariate Analysis

What is the structure of your dataset?

This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume) Output variable (based on sensory data):
  12. quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

The primary focus of this analysis is quality. Specifically, Can we identify which properties will have the greatest impact on the quality evaluation. Based on the defitions provided, I believe the analysis will show the following properties will have the most signifcant impact.

  1. Alcohol
  • the percent alcohol content of the wine
  1. Acidity
  • fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  • volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  • citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  1. Sulfur dioxide:
  • free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  • total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

At this point, strictly based on gut instinct, I believe alcohol, pH, chlorides, denisty, and others properties will impact the quality of the wine. However, I rarely drink wine and have an extremely limited knownledge of the subject. In addition, I’ll be watching properties that have a wide range between the 1st and 3rd Quantile.

Did you create any new variables from existing variables in the dataset?

Yes, based on the defitions provided in the dataset, the acid levels could largely impact the qaulity. Therefore, I created an additional properties total.acids

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the properties have long rigt tails. Plus, the alcohol distrubtion was interesting. In addition, I was surprised by the number of properities with exemtre outliers like sugar. However, after a little thought, when you consider wines range from dry to sweet, isn’t surprising with the residual sugar mean of 6.391 and max of 65.8 grams. Also, the histogram clearly shows there are no terrible or excellent wines with all wines rated scoring between 3 and 9.

Bivariate Plots Section

Grid of Histograms by Quality

This grid histograms by quality shows both the impact of each property as well illustrates that there are few samples of lower and higher quality wines.

GG Pairs

Using the output of ggpairs, we can see the relation between alcohol and quality. So, we’ll plot the impact for each property by alochol and quality.

Grid of Boxplots by Quality

  1. An extremely outlier with in the 11% alcohol and quality rating of 6
  2. Wines with quality rating of 3 have a highest IQR and widest fixed acidity range excluding outliers
  3. As wine quality goes up, fixed acidity goes down excluding quality 8 and 9.
  4. Wines with 10% alcohol has the largest number of outliers

  1. Higher % Alcohol wines have higher volatile acidity
  2. Quality 5 wines have the most outliers
  3. Quality 3 and 4 wines have a wide IQR
  4. 12% Alcohol wines a few outliers above 0.9 similar to Qaulity 4 wines

  1. Citric Acid levels decrease as percent alcohol increases
  2. Excluding wines with quality of 3 and 9, the IQR becomes narrower as quality increases.

  1. Wines with higher percent alcohol have lower levels of residual sugar.
  2. One extreme outlier at quality 6 and 12% alcohol
  3. Very few outliers

  1. Wines with higher percent alcohol have lower chlorides levels
  2. Higher qulaity wines have lower chlorides levels
  3. Larger number of outliers

  1. Wines with higher percent alcohol have lower free sulfur dioxide levels
  2. Wines with quality ranking of 3 and 4 have a lower level of free sulfur dioxide
  3. Seems to be an interesting tied between free sulfur dioxide, alcohol, and quality

  1. Seems to be a relationship between 10% alcohol and quality rating of 5
  2. A couple of extreme outliers

  1. Clear relationship between alcohol and density, high percent alcohol equals lower density
  2. Very few outliers

  1. Excluding wines with quality of 3, higher quality wines have high pH levels.
  2. No relationship between alcohol and pH

  1. Sulphates seem to vary for quality and percent alcohol

  1. Total Acids seem to have little impact on quality

It appears higher pH, free sulfur dixiode, and alcohol have a positive impact on quality. While higher chlorides, total sulfur dioxide, density, and volatile acidity also have a negative impact on quality. Based on the initial visualizations, it appears alcohol has the most significant impact. Next, we’ll look at the Coorelation coefficient.

Calculate Coorelation coefficient

##                                     [,1]
## fixed.acidity               -0.113662831
## volatile.acidity            -0.194722969
## citric.acid                 -0.009209091
## residual.sugar              -0.097576829
## chlorides                   -0.209934411
## free.sulfur.dioxide          0.008158067
## total.sulfur.dioxide        -0.174737218
## density                     -0.307123313
## pH                           0.099427246
## sulphates                    0.053677877
## alcohol                      0.435574715
## total.acids                 -0.136319694
## pH.bucket                    0.097255315
## rounded.free.sulfur.dioxide  0.007895089

Reviewing the Coorelation coefficients comfirms the information observed in the boxplots.

Percent of Wines by Quality and Colored by Percent Alcohol

This visualization clearly shows the impact of percent alcohol on quality. Selecting a wine with a percent alcohol of 12,13, or 14 means you’ll likely select a wine quality of 7,8, or 9.

Percent of Wines by Quality and Colored by pH

Mean pH level for wines with quality less than or equal to 6

with(subset(wine,quality <= 6),mean(pH))
## [1] 3.180847

Mean pH level for wines with quality greater than or equal to 7

with(subset(wine,quality >= 7),mean(pH))
## [1] 3.215132

It’s somewhat difficult to see in the chart, wines with a lower quality rating, less than or equal to 6, have lower pH level. In addition, wines with a higher quality, great than or equal to 7, rating have a higher mean pH.

Percent of Wines by Quality and Colored by Rounded Chlorides

All wines with a quality of 7,8, and 9 have a chloride level less than or equal to 0.1. Also, lower quality wines seem to have a higher chloride level.

Percent of Wines by Quality and Colored by Rounded Total Sulfur Dioxide / 10

Lower quality wines tend to have higher total sulfur dioxide level. Of the wines with a quality great than 5, 0.6777164 % have a total sulfur dioxide less than 150.

Percent of Wines by Quality and Colored by Rounded Density

While the boxplot showed a relationship between quality and density, it’s hard to glean any conclusion from this chart. However, it appears wines with a between .98 and .99 score higher.

Percent of Wines by Quality and Colored by Rounded Volatile Acidity

Based on this chart, it appears high volatile acid has a negative impact on quality.

Percent of Wines by Quality and Colored by Residual Sugar

Interesting but nothing stands out…

Property by Alcohol Bucket

Next, We’ll look at the impact of alcohol content on other wine properties.

##                                    [,1]
## fixed.acidity               -0.12088112
## volatile.acidity             0.06771794
## citric.acid                 -0.07572873
## residual.sugar              -0.45063122
## chlorides                   -0.36018871
## free.sulfur.dioxide         -0.25010394
## total.sulfur.dioxide        -0.44889210
## density                     -0.78013762
## pH                           0.12143210
## sulphates                   -0.01743277
## total.acids                 -0.11229714
## rounded.free.sulfur.dioxide -0.24688985

This set of charts illustrate the impacts of the fermentation process and the relationship it has on all the other property values. In addition, it helps provide visual confirmation on the correlation coefficient data provided above. Again, as alcohol content increase all other property levels decrease except pH level and volatile.acidity.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As it relates to quality, the visualizations and Coorelation coefficient indicate the following:

  1. Alcohol is the strongest factor in wine quality.
  2. Higher Quality Wine have lower:
  • Total Sulfur Dioxide
  • Total Acid
  • Redisual Sugar
  • Chlorides
  1. Higher Qualiy Wine have higher:
  • pH
  • Volatile Acidity
  1. Density and Residual Sugar, both linked with alcohol, also show a strong relationship.

  2. Positive Correlation
  • Alcohol : 0.435574715
  • pH: 0.099427246
  • Sulphates : 0.053677877
  • Free Sulfur Dioxide: 0.008158067
  1. Negative Correlation
  • Density : -0.307123313
  • Chlorides : -0.209934411
  • Volatile Acidity : -0.194722969
  • Total Sulfur Dioxide : -0.174737218

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The negative relationship between alcohol content and density.

What was the strongest relationship you found?

The strongest relationship that impacted the quality of the wine was alcohol at 0.436. However, the strongest relationship overall was Residual Sugar and Density: 0.839.

Multivariate Plots Section

With some basic data analysis complete, the data shows the strong relationship between alcohol content and quality. Next, we’ll review the impacts of three positive properties (pH, sulphates, and free sulfur dixiode) and three negative properties (density, chlorides, and volatile acidity) when compared to quality and alcohol.

Plots with positive Coorelation Coefficient to Quality

Lower quality wines have lower alcohol and lower pH levels.

Sulphates seem to have little impact on quality.

There are 180 wines with a quality rating greater than 7. Of those wines, 158 have percent alcohol great than 10 and 135 have a free sulfur dioxide greater than 20mg.

Plots with Negative Coorelation Coefficient to Quality

Lower quality wines have higher density.

Lower quality wines have higher chloride levels.

Lower quality wines have lower alcohol content and higher volatile acids.

Model

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Based on the analysis, it appears to be much easier to produce a poor quality wine. I was surprised high chloride levels impact the wine quality. In addition, There are 180 wines with a quality rating greater than 7. Of those wines, 158 wines have percent alcohol greater than 10 and 135 have a free sulfur dioxide greater than 20mg. It’s interesting to see that free sulfur dioxide level have a positive impact. While high total sulfur dioxide levels have a negative impact.

Were there any interesting or surprising interactions between features?

Clearly, the fermentation process has an impact on alcohol content which in turn impacts other properties. It would be interesting to sample wines that have a higher percent alcohol to determine when the alcohol contents has a negative impact on the quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes… but, I don’t fully understand the output.

Plot One

Description One

Throughout the analysis of the white wine samples, alcohol content has been the strongest factor in wine quality. Now, Let’s review the negative coorelation coefficients that impact wine quality. By summing properties with a negative coorelation coefficient, it becomes extremely clear how these values impact the quality of the wine. As a general observation, lower quality wines have an alcohol content of 12% with higher levels of total acidity, chlorides, density, and total sulphur dioxide. On the flipside, higher quality wines have lower levels of total acidity, chlorides, density, and total sulphur dioxide with an alcohol content above 12%. *Note - This chart includes wine quality 4 and above.

Plot Two

Description Two

Now, Let’s review the two positive coorelation coefficients that impact wine quality. By creating ratio of free sulfur dioxide / pH, we can see most higher quality wines have a free sulfur dioxide /pH between 5-18 and alcohol content above 11%. *Note - This chart includes wine quality 4 and above.

Plot Three

Description Three

As you can see from this chart and previous charts, when alcohol content is high, chlorides, sulphates, total acids, and residual sugar are low. This typically results in a higher quality wine. This chart provides a lay person like myself a simple method for selecting a quality wine. Selecting a wine with an alcohol content of eleven or higher will most likely provide the consumer an average to better then average wine tasting experience.

Reflection

Findings

I started analyzing the dataset by looking at several univariate plots inlcuding a simple histogram and boxplot. With the exception of alcohol, density, and residual sugar, most wine properties had more outliers then I expected. The boxplot findings lead me to revise the histograms adding the density line and vertical mean line. The density line and vertical mean line, made it easier to identify that most properties had long right tails and are right-skewed.

After obtaining a general understanding of the dataset using the univariate plots, I developed several bivariate plots. First histogram by quality, this provided the user something visually interesting and evidence that there are few high and low quality wine samples, but overall I found the graphs difficult to read. Next, the box plots started providing insight into the relationship between wine properties and quality. The boxplot clearly showed the impact alcohol has on quality. In addition, the box plots helped identify the negative impacts of higher chlorides, total sulfur dioxide, density, and volatile acidity on quality. After reviewing the boxplot findings, I developed several additional visualization to confirm my findings including Percent of Wines by Quality for each positive and negative coorelationship coefficient .

Last, I created several additional multivariate plots. Using the data points provided, I confirmed that alcohol has the largest impact on quality. In addition, it appears the negative properties like Density, Chlorides, Volatile Acidity, and Total Sulfur Dioxide have a impact on quality.

Struggles

This dataset provides nearly 5000 observations of white wines sampled and rated by wine experts. The sample contained eleven different properties for white wine plus the quality rating. While on the surface this seemed like ample data to provide an accurate analysis, I found myself wanting more information. Primarily, equal sample of percent alcohol would help to prove that alcohol is the most significant factor in wine quality. In addition, other properties like location, variety of the grapes, person sampling the wine, and etc. would help to provide a more complete assessment of the wines.

As a person that doesn’t drink wine on a relgular basis, this analysis has lead me to believe that selecting a bottle of wine with a high alcohol content will increse my chances of choosing a higher quality wine.

Reference

https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

http://stackoverflow.com/questions/24776200/ggplot-replace-count-with-percentage-in-geom-bar

http://docs.ggplot2.org/current/

http://cran.r-project.org/web/packages/RColorBrewer/index.html

http://www.realsimple.com/holidays-entertaining/entertaining/food-drink/alcohol-content-wine

http://stackoverflow.com

http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/

http://galahad.well.ox.ac.uk/repro/

http://www.frenchscout.com/types-of-white-wines

http://www.morewinemaking.com/public/pdf/wwhiw.pdf

Summary of each property by quality

Remove the eval=FALSE to print additional charts

# I don't fully understand this code 
# but it produced the output I wanted
# http://galahad.well.ox.ac.uk/repro/
wineSummary <- apply(wine[, 1:11], 2, 
                     function(x) tapply(x,wine$quality, summary))
wineSummary <- lapply(wineSummary, do.call, what = rbind)
wineSummary

Additional Graphs

Remove the eval=FALSE to print additional charts

mvp <- function(yv,xv,c,row,column){
  ggplot(aes_string(y = yv, x = xv ), 
         data = wine) + 
    coord_trans(x = "log10") +
    coord_trans(y = "log10") +
    geom_point(aes_string(color=c)) +
    stat_ellipse(aes(color = quality.as.factor), 
                 linetype = 1,type = "t", level = 0.50) +
    coord_cartesian(ylim = c(min(wine[[yv]]), 
                             quantile(wine[[yv]], 0.99)),
                    xlim = c(min(wine[[xv]]), 
                             quantile(wine[[xv]], 0.99))) +
    ggtitle(paste(yv," vs ", xv, "\n by Quality")) +
    scale_color_brewer(palette = "RdYlGn", guide = guide_legend(reverse=TRUE)) 
}

wine$bounded.sulfur.dioxide <- wine$total.sulfur.dioxide - 
                               wine$free.sulfur.dioxide 

names <- c(
 "alcohol"
 ,"fixed.acidity"        
 ,"volatile.acidity"     
 ,"citric.acid"          
 ,"residual.sugar"       
 ,"chlorides"            
 ,"free.sulfur.dioxide" 
 ,"bounded.sulfur.dioxide"
 ,"total.sulfur.dioxide" 
 ,"density"              
 ,"pH"                   
 ,"sulphates" 
 ,"total.acids"
 )  

for(y in names){
  for(x in names){
    if(y != x)
      print(mvp(y,x,"quality.as.factor"))
  }
}